consistency measure
Quantifying Prediction Consistency Under Model Multiplicity in Tabular LLMs
Hamman, Faisal, Dissanayake, Pasan, Mishra, Saumitra, Lecue, Freddy, Dutta, Sanghamitra
Fine-tuning large language models (LLMs) on limited tabular data for classification tasks can lead to \textit{fine-tuning multiplicity}, where equally well-performing models make conflicting predictions on the same inputs due to variations in the training process (i.e., seed, random weight initialization, retraining on additional or deleted samples). This raises critical concerns about the robustness and reliability of Tabular LLMs, particularly when deployed for high-stakes decision-making, such as finance, hiring, education, healthcare, etc. This work formalizes the challenge of fine-tuning multiplicity in Tabular LLMs and proposes a novel metric to quantify the robustness of individual predictions without expensive model retraining. Our metric quantifies a prediction's stability by analyzing (sampling) the model's local behavior around the input in the embedding space. Interestingly, we show that sampling in the local neighborhood can be leveraged to provide probabilistic robustness guarantees against a broad class of fine-tuned models. By leveraging Bernstein's Inequality, we show that predictions with sufficiently high robustness (as defined by our measure) will remain consistent with high probability. We also provide empirical evaluation on real-world datasets to support our theoretical results. Our work highlights the importance of addressing fine-tuning instabilities to enable trustworthy deployment of LLMs in high-stakes and safety-critical applications.
Measuring Reliability of Large Language Models through Semantic Consistency
Raj, Harsh, Rosati, Domenic, Majumdar, Subhabrata
While large pretrained language models (PLMs) demonstrate incredible fluency and performance on many natural language tasks, recent work has shown that well-performing PLMs are very sensitive to what prompts are feed into them. Even when prompts are semantically identical, language models may give very different answers. When considering safe and trustworthy deployments of PLMs we would like their outputs to be consistent under prompts that mean the same thing or convey the same intent. While some work has looked into how state-of-the-art PLMs address this need, they have been limited to only evaluating lexical equality of single- or multi-word answers and do not address consistency of generative text sequences. In order to understand consistency of PLMs under text generation settings, we develop a measure of semantic consistency that allows the comparison of open-ended text outputs. We implement several versions of this consistency metric to evaluate the performance of a number of PLMs on paraphrased versions of questions in the TruthfulQA dataset, we find that our proposed metrics are considerably more consistent than traditional metrics embodying lexical consistency, and also correlate with human evaluation of output consistency to a higher degree.
ST-CoNAL: Consistency-Based Acquisition Criterion Using Temporal Self-Ensemble for Active Learning
Baik, Jae Soon, Yoon, In Young, Choi, Jun Won
Modern deep learning has achieved great success in various fields. However, it requires the labeling of huge amounts of data, which is expensive and labor-intensive. Active learning (AL), which identifies the most informative samples to be labeled, is becoming increasingly important to maximize the efficiency of the training process. The existing AL methods mostly use only a single final fixed model for acquiring the samples to be labeled. This strategy may not be good enough in that the structural uncertainty of a model for given training data is not considered to acquire the samples. In this study, we propose a novel acquisition criterion based on temporal self-ensemble generated by conventional stochastic gradient descent (SGD) optimization. These self-ensemble models are obtained by capturing the intermediate network weights obtained through SGD iterations. Our acquisition function relies on a consistency measure between the student and teacher models. The student models are given a fixed number of temporal self-ensemble models, and the teacher model is constructed by averaging the weights of the student models. Using the proposed acquisition criterion, we present an AL algorithm, namely student-teacher consistency-based AL (ST-CoNAL). Experiments conducted for image classification tasks on CIFAR-10, CIFAR-100, Caltech-256, and Tiny ImageNet datasets demonstrate that the proposed ST-CoNAL achieves significantly better performance than the existing acquisition methods. Furthermore, extensive experiments show the robustness and effectiveness of our methods.
Consistency Measures for Feature Selection: A Formal Definition, Relative Sensitivity Comparison and a Fast Algorithm
Shin, Kilho (University of Hyogo) | Fernandes, Danny (University of Hyogo) | Miyazaki, Seiya (Panasonic Corporation)
Consistency-based feature selection is an important category of feature selection research yet is defined only intuitively in the literature. First, we formally define a consistency measure, and then using this definition, evaluate 19 feature selection measures from the literature. While only 5 of these were labeledas consistency measures by their original authors, by our definition, an additional 9 measures should be classified as consistency measures. To compare these 14 consistency measures in terms of sensitivity, we introduce the concept of quasilinear compatibility order, and partially determine the order among the measures. Next, we proposea new fast algorithm for consistency-based feature selection. We ran experiments using eleven large datasets to compare the performance of our algorithm against INTERACT and LCC, the only two instances of consistency-based algorithms with potential real world application. Our algorithm shows vast improvement in time efficiency, while its performance in accuracy is comparable with that of INTERACT and LCC.
Softening Discrete Relaxation
Finch, Andrew M., Wilson, Richard C., Hancock, Edwin R.
This paper describes a new framework for relational graph matching. The starting point is a recently reported Bayesian consistency measure which gauges structural differences using Hamming distance. The main contributions of the work are threefold. Firstly, we demonstrate how the discrete components of the cost function can be softened. The second contribution is to show how the softened cost function can be used to locate matches using continuous nonlinear optimisation. Finally, we show how the resulting graph matching algorithm relates to the standard quadratic assignment problem. 1 Introduction Graph matching [6, 5, 7, 2, 3, 12, 11J is a topic of central importance in pattern perception. The main computational issues are how to compare inexact relational descriptions (7J and how to search efficiently for the best match [8J. These two issues have recently stimulated interest in the connectionist literature (9, 6, 5, lOJ. For instance, Simic [9], Suganathan et al. (101 and Gold et ai.
Softening Discrete Relaxation
Finch, Andrew M., Wilson, Richard C., Hancock, Edwin R.
This paper describes a new framework for relational graph matching. The starting point is a recently reported Bayesian consistency measure which gauges structural differences using Hamming distance. The main contributions of the work are threefold. Firstly, we demonstrate how the discrete components of the cost function can be softened. The second contribution is to show how the softened cost function can be used to locate matches using continuous nonlinear optimisation. Finally, we show how the resulting graph matching algorithm relates to the standard quadratic assignment problem. 1 Introduction Graph matching [6, 5, 7, 2, 3, 12, 11J is a topic of central importance in pattern perception. The main computational issues are how to compare inexact relational descriptions (7J and how to search efficiently for the best match [8J. These two issues have recently stimulated interest in the connectionist literature (9, 6, 5, lOJ. For instance, Simic [9], Suganathan et al. (101 and Gold et ai.
Softening Discrete Relaxation
Finch, Andrew M., Wilson, Richard C., Hancock, Edwin R.
This paper describes a new framework for relational graph matching. Thestarting point is a recently reported Bayesian consistency measure which gauges structural differences using Hamming distance. Themain contributions of the work are threefold. Firstly, we demonstrate how the discrete components of the cost function canbe softened. The second contribution is to show how the softened cost function can be used to locate matches using continuous nonlinear optimisation. Finally, we show how the resulting graphmatching algorithm relates to the standard quadratic assignment problem. 1 Introduction Graph matching [6, 5, 7, 2, 3, 12, 11J is a topic of central importance in pattern perception. The main computational issues are how to compare inexact relational descriptions (7J and how to search efficiently for the best match [8J. These two issues have recently stimulated interest in the connectionist literature (9, 6, 5, lOJ. For instance, Simic [9], Suganathan et al. (101 and Gold et ai.
Hyperparameters Evidence and Generalisation for an Unrealisable Rule
Using a statistical mechanical formalism we calculate the evidence, generalisation error and consistency measure for a linear perceptron trained and tested on a set of examples generated by a non linear teacher. The teacher is said to be unrealisable because the student can never model it without error. Our model allows us to interpolate between the known case of a linear teacher, and an unrealisable, nonlinear teacher. A comparison of the hyperparameters which maximise the evidence with those that optimise the performance measures reveals that, in the nonlinear case, the evidence procedure is a misleading guide to optimising performance. Finally, we explore the extent to which the evidence procedure is unreliable and find that, despite being sub-optimal, in some circumstances it might be a useful method for fixing the hyperparameters. 1 INTRODUCTION The analysis of supervised learning or learning from examples is a major field of research within neural networks.